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ABSTRACT 

Though item bias statistics are widely recommended 
for use in test development and analysis, problems arise in their 
interpretation. This research evaluates logistic test models and 
computer simulation methods for providing a frame of reference for 
interpreting item bias statistics. Specifically, the intent was to 
produce simulated sampling distributions of item bins statistics 
under the no-bias hypothesis, for use in determining cut-off points 
to provide guidelines for interpreting item bias statistics obtained 
with actual test data. In this case, potential sex bias was studied 
in the item responses of 937 Cleveland ninth graders to 75 items from 
the 1985 Cleveland Reading Competency Test. Results supported the 
basic data simulation approach used in the study. Real and simulated 
distributions for three item bias statistics (area between 
characteristic curves, root mean squared differences between curves, 
and the Mantel-Haenszel statistic) when bias was not present were 
very similar. The minor differences found between the distributions 
had little effect on the interpretation of item bias statistics 
obtained with actual data. Seven steps for applying the method of 
computer-simulated baseline statistics in test development settings 
were outlined. (Author/LPG) 
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O 

Lul Abstract 



Though Item bias statistics are widely recommended for use In test 
development and test analysis work, problems arise In their 
Interpretation. The purpose of the present research was to evaluate 
the value of logistic test models and computer simulation methods for 
providing a frame of reference for Item bias statistic Interpretations. 
Specifically, the Intent was to produce 5;1mulated sampling distri- 
butions of Item bias statistics under the hypothesis of no bias for use 
in determining cut-off points to provide guidelines for Interpreting 
Item bias statistics obtained with actual test data. 



The results provided support for the basic data simulation 
approach used In the study. Real and simulated distributions for three 
item bias statistics when bias was not present were very similar and 
the minor differences that were found between the distributions had 
little effect on the Interpretations of Item bias statistics obtained 
with actual test data> Seven steps for applying the method of 
computer-simulated baseline statistics In test development settings 
were outlined In the paper. 



JANE. 2.1 



U 8 DEPARTMENT Of EDUCATION 

Office of Educational Research and tmprovement 

EDUCATIONAL RESOURCES INFORMATION 
CENTER (ERIC) 

JI^This document has b*9n reproduced as 
received from the person or organization 
originating it 
□ Minor Changes have been made to improve 
reproduction quality 

• Points of view oropinionsstatedinthisdocu 
ment do not necessarily represent official 
OERI position or policy 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GHANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) " 



o 

ERIC 



2 



BEST copy AVAILABLE 



Evaluation of Computer Simulated Baseline Statistics 
for Use in Item Bias Studies^ »^ 

H. Jane Rogers and Ronald K. Hambleton 
University of Massachusetts, Amherst 

The great public concern in this country over unfairness or bias 
in testing has resulted in substantial numbers of research studies that 
have described and evaluated new methods for identifying potentially 
biased test items (Berk, 1982; Shepard, Camilli and Averill, 1981; 
Shepard, Camilli and Williams 1985). Most of the new methods are based 
upon item response models and related procedures and involve the 
calculation of statistics which are unfamiliar to test developers 
(e.g., weighted b value differences, area between two item 
characteristic curves, sum of squared differences between two item 
characteristic curves). 

One problem that has arisen in test development work concerns the 
interpretations of these new item bias statistics. Certainly the 
statistics, whatever their interpretation, can be used to rank-order 
test items to identify the items of most and least concern. But test 
developers often want to sort test items into ordered categories (e.g., 
"must be very carefully reviewed", "may need revision", "should be 
acceptable") and for this purpose, critical valuer or cut-off points 
for classifying the item bias statistics would be useful. The 
advantage of a classificatory approach as opposed to an approach based 
upon item rankings, is that the number of potentially biased items does 
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not need to be specified in advance of the analysis. Thus the number 
of items identified as potentially biased would depend on the dataset. 
Of course the main difficulty in placing items into categories is 
determining a frame of reference and subsequently cut-off scores for 
interpreting the IRT item bias statistics of interest. 

The purpose of the present research was to evaluate the value of 
logistic test models and computer simulation methods for generating 
sampling distributions of item bias statistics under the hypothesis of 
no item bias, for use in determining cut-off points to provide 
guidelines for interpreting item bias statistics. 

This study was prompted by some earlier research by Hambleton, 
Rogers and Arrasmith (1986). These authors used baseline data for 
interpreting item bias statistics which were provided by two randomly 
equivalent majority samples, and two randomly equivalent minority 
samples, obtained from real data. This meant that while meaningful 
baseline results were available, the important comparisons between the 
majority and minority groups were carried out with sample sizes half 
the size of those sample sizes that were actually available. Reduction 
of sample sizes by 50% to obtain baseline information is a high price 
to pay when sample sizes are often not very large to begin with. Small 
sample item bias studies are especially problematic when IRT methods 
are used. The results from the Hambleton et al . study showed that 
logistic models could be used to provide simulated results to serve as 
a baseline for interpreting item bias statistics. But it was also 
clear that more research was needed to strengthen their conclusion. 
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Another way that item bias baseline statistics might be compiled 
is by combining the majority and minority groups of interest and then 
conducting an item bias investigation using two randomly equivalent 
samples drawn from the combined sample (Shepard, Camilli, & Williams, 
1984; Wilson-Burt, Fitzmartin, S Skaggs, 1986). Since item bias should 
not be present in two randomly equivalent groups, the distribution of 
item bias statistics obtained in two randomly equivalent groups could 
serve as a basis for setting cut-off scores for interpreting item bias 
statistics in the majority and minority samples. 

The main shortcoming of this approach, and it is a shortcoming of 
the early Hambleton et al . (1986) work too, is that any difference in 
the ability distributions between the majority and minority groups is 
not reflected in the two randomly equivalent samples used to obtain the 
baselin*? statistics. Since group ability distributions can influence 
the quality of item bias statistics (see for example, Shepard et al., 
1984; Wilson-Burt, Fitzmartin, & Skaggs, 1986), failure to incorporate 
this information in the analysis could reduce the usefulness of the 
distribution of item bias statistics obtained with the two randomly 
equivalent samples. One solution that is sometimes applied when the 
majority group is large involves selecting an examinee sample from the 
majority group to approximate the distribution of scores in the 
minority group (see, for example, Shepard et al., 1984). On the other 
hand, such ability differences and other unique features of the 
majority and minority samples can be incorporated into a 
computer-simulated item bias analysis regardless of the available 
sample sizes. For this reason, our research centered on the potential 
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valua of computer-simulation techniques for providing the desired 
baseline distributions. 

Method 

Choice of Item Bias Statistics 

Three popular item bias statistics were chosen for the investiga- 
tion: area method, root mean squared difference method, and the Mantel- 
Haenszel method. 

Area Method. In the Area Method, or Total Area Method as it is 
sometimes called, the area between item characteristic curves for the 
same item obtained in the majority and minority groups over a specified 
interval on the ability scale (-3 to +3, in this study) is used as an 
estimate cf item bias (Rudner, Getson, & Knight, 1980). An item is 
labeled as "potentially biased" when the area between the two curves is 
large. 

Root Mean Squared Difference Method. In applying this method 
(Linn, Levine, Hastings, A Wardrop, 1981), the squared difference 
between the majority and minority item characteristic curves, at fixed 
intervals (usually .01) is calculated. These squared differences are 
calculated over the interval on the ability scale which is of interest. 
Finally, an average of the squared differences is calculated and the 
square root of the average is taken. Again, large-valued statistics 
reflect substantial differences between item characteristic curves, and 
items associated with large-valued statistics are labeled as 
"potentially biased." 

Mantel-Haenszel Method. The Mantel-Haenszel statistic has 
generated considerable interest among test developers in recent years 
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because it appears to provide a quick, cheap, and valid indicator of 
item bias (Holland & Thayer, 1986). Unlike the other two methods, this 
method does not involve the application of IRT models and principles. 
In essence, the method first matches examinees on a criterion variable, 
often the overall test score because of convenience. The ratio of the 
odds for success of the majority and minority group members are 
calculated in each score group of interest (with n items, there are n+1 
possible score groups). Each ratio is weighted by the sample size in 
the score group and then the ratios for the (up to) ri+1 score groups 
are combined to obtain the Mantel -Haenszel statistic. When the odds 
for success on an item in the majority and minority groups among 
examinees of the same ability level are substantially different, item 
bias is suspected. The advantage of this method over the other two 
described above is that the statistic has a known sampling distribution 
(chi-square with one degree of freedom) and so meaningful cutoff scores 
can be established. This statistic was considered because of the 
substantial interest in its use in item bias work. 
Description of the Test Data and Examinee. Sample 

The test data used in the study were the item responses of 937 
Cleveland ninth grade students to 75 items on the 1985 Cleveland 
Reading Competency Test. There were 207 Whites and 730 Blacks; and 451 
Males and 486 Females in the sample. Because of the very small number 
of Whites in the sample, only a sex bias study was carried out. 
Generation of Simulated Examinee Item Responses 

Basically, the approach was to simulate examinee item response 
data that reflected as closely as possible the actual examinee and item 
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data of interest but without any item bias. Item parameter and ability 
parameter estimates obtained from the combined group three- parameter 
logistic model analysis were treated as "true values" and then a 
simulated sec of item responses for the 937 examinees was generated 
using the three- parameter logistic model (Hambleton & Rovinelli, 1973). 
In this way, the simulated iten. responses were generated to be 
consistent with the item and ability parameter estimates obtained with 
the real data, but without bias. There was no bias because male and 
female item response data were generated from a common set of 
three-parameter item characteristic curves obtained from the analysis 
of the total set of test data. Any differences in ability scores 
between the majority ^nd minority groups were retained because the 
ability estimates obtained from the analysis of the real data were used 
in the simulations. A parallel set of item bias analyses were carried 
out on the real and simulated data. Differences in the distributions 
of item bias statistics would arise if bias were present in the real 
data since in all other respects the datasets were equivalent, assuming 
of course that the three-parameter logistic model provided an 
appropriate fit to the real data. For this reason, the fit of the 
three-parameter logistic model to the test data was checked carefully 
(Hambleton & Rogers, in press; Hambleton & Swaminathan, 1985). 
Procedure 

With the actual and simulated test data in hand, three sets of 
analyses were carried out: The first analysis was intended to evaluate 
the merits of computer simulated baseline sampling distributions of 
item bias statistics. This analysis involved the comparison of 

JANE. 2. 6 
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stati sties in the Mj and and M2 and F2 samples in the 
simulated data could be calculated for the purpose of 
producing a sampling distribution of each item bias statistic 
of interest under the hypothesis of no bias . Mj and Fj, and 
M2 and F2 comparisons were preferred to the Mj and M2, Fj and 
F2 comparisons because the former subgroups reflected any real 
ability differences in the Male and Female samples whereas the 
latter subgroups did not. 

2. Separate modified three- parameter model analyses of the Mj, 
M2, Fj, and F2 real and simulated data were carried out. The 
c parameter was fixed at a value of .20. Eight IRT analyses, 
in all, were carried out. Ability estimates obtained from the 
combined group analysis were also fixed in these analyses. 

3. After the necessary data rescalings, two of the item bias 
statistics of interest - Area and Root Mean Squared 
Difference - were calculated for the group comparisons listed 
below. The Mantel-Haenszel statistics were calculated using 
the item response data provided at step 1. 

Real Data 

a. Mj vs Fj 

b. M2 vs F2 (this analysis served as a replication 

of the study with the Mj and Fj samples) 

c. Mj vs M2 

d. Fj vs F2 

e. M vs F 
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Simulated Data 



4. For each item bias statistic, the following distributions were 
obtained: 

Real Data 

a. the combined distribution of Mj vs M2 and Fj vs F2 item 
• bias statistics. (This distribution served as the 

baseline for interpreting the real item bias statistics 
obtained from the Mj vs Fj and M2 vs F2 comparisons.) 

b. the distributions of the Mj vs Fj, and the M2 vs F2 
item bias statistics. (The M2 vs F2 comparison 
served as a replication of the Mj vs Fj comparison.) 

Simulated Data 

c. the combined distribution of Mj vs Fj and M2 vs F2 item 
bias statistics. (This distribution served as the 
alternate baseline for interpreting the real item 

bias statistics obtained from the Mj vs Fj, and M2 
vs F2 groups.) This distribution was compared to 
4(a) obtained above to assess" the viability of the 
computer-generated sampling distributions of item 
bias statistics. 

JANE. 2.9 



f. Mj vs Fj 

g. M2 vs F2 

h. M vs F 
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5. The distributions obtained in step 4 (except for the real M 
vs F comparison) were smoothed by the method of "weighted 
rolling averages" (Kendall & Stuart, 1968) to remove some of 
the minor irregularities in the distributions. 

6. The cut-off score corresponding to the .05 level of signifi- 
cance for each distribution (real and simulated) generated 
under the hypothesis of no bias was determined. 

7. The cut-off scores obtained at step 6 were applied to the real 
item bias statistics to compare their effects. 

In a final phase of the research, the IRT computer simulation 
method was used to provide a baseline distribution for interpreting 
item bias statistics obtained in the full Male and Female samples. 

Results 

Model-Data Fit 

The results from this study would have been meaningless unless the 
three-parameter logistic model had at least provided an adequate 
accounting of the actual item response data. Fortunately, the model 
fit the test data well. The average residual (actual 
performance-expected performance assuming model -data fit) was .01. 
This average' was based on 12 comparisons (at ability levels -2.75, 
-2.25, 2.75) of the observed and expected performance for each of 

the 75 items in the test. Clearly, there was no overall bias in the 
fit of the item and ability parameter estimates to the test data. The 
average absolute residual calculated at each of the same ability levels 
across the 75 items was alsu very small. It exceeded a value of .05 at 
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fnur ability levels, -2.75, -2.25, -1.75, and 2.75 where the combined 
examinee sample was only 71 (about 7.5X of the total sample). In sum. 
the goodness-of-fit results indicated a close fit between the 
best-fitting model and the actual test data. 
Comparison of the R^al and Simulated Null Distributions 

Tables 1, 2, and 3 provide the smoothed distributions under the 
hypothesis of no bias for the thre-? Uot bias statistics with both real 
and simulated data. Figures, 1, 2, and 3 highlight the same 



Insert Tables 1, 2, and 3 and Figurec 1, 2, and 3 about here 



information in graphical form. The results are clear: There was very 
little difference between the sampling distributions of the item bias 
statistics generated with real and simulated data. The maximum 
difference in the sampling distributions with real and simulated data 
was 7.8%. Also, the largest differences were always observed in the 
lower halves of the sampling distributions where the consequences of 
differences on the determination of cut-off values wei^e small. 
Effect of Choice of Sampling Distribution 

Perhaps the best way to judge the effects of choosing the 
simulated over the real distributions of item bias statistics under the 
hypothesis of no bias is in terms of the practical consequences of 
different cut-off scores derived from the two distributions. Table 4 
provides the .05 cut-off score for the real and simulated distributions 
for each item bias statistic under the hypothesis of no bias. These 
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Insert Table 4 and Figures 4, 5, and 6 abrut here 



cut-off scores were then applied to the vs Fj and M2 vs Fg real item 
bias data. The smoothed simulated distribution without bias and the 
smoothed real distribution of item bias statistics for the Mj vs Fj 
item bias study are shown in Figures 4, 5, and 6. 

Table 4 shows that there were difierences in the determination cf 
cut-off scores with the real and simulated distributions. Thesf 
differences influenced the numbers of test items identified at the .05 
level though the influence of choice of distribution appeared to be 
small. Across six comparisons, the average difference was three items. 
In view of the close similarity in the distributions as reflected in 
Figures 1, 2, and 3, it is likely that the differences reflected, to a 
great extent, the instability in determining the .05 cut-off score 
because of the very limited amounts of data in the tails of the 
distributions. Smoothing the simulated distributions was helpful but 
basically the problem remained: there was a limited number of data 
points in the tails of the distributions. In addition, some differences 
in the results were expected because the simulated distributions 
reflected the ability distribution differences in the Male and Female 
samples better than the real null distributions under the hypothesis of 
no bias. 
An Example 

Though samples of (approximately) 450 Males and Females were 
available for the research investigation, it was necessary to divide 
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each sample in half so that various comparisons of results could be 
made to evaluate the merits of our computer simulation. In practice, a 
test developer would carry out the item bias study with the full set of 
available data. Figures 7, 8, and 9 highlight the results of the item 
bias investigation using the full Male and Female samples, and using 
smoothed computer-simulated distributions of item bias statistics 
without any bias, to provide baseline data for interpreting the 
results. Using the .05 level of significance, the numbers of items in 
need of careful review were obtained. The numbers varied depending on 
the choice of item bias statistic: eight items with the Area method, 
six items with the Root Mean Squared Difference method, and 20 items 
with the Mantel-Haenszel method. 



Insert Figures 7, 8, and 9 about here 



Conclusions 

The results of this study reported in Tables 1 to 3 and Figures 1 
to 3 provided support for the use of simulated data to establish 
critical values for IRT item bias statistics. When the test data fit 
the model chosen, use of the IRT parameter estimates to generate data 
allows the user to simulate samples closely resembling the original 
data but under conditions of no bias. Though the results in these 
tables and figures do not provide much evidence of the importance of 
retaining ability differences in the simulations of majority and 
minority group performance, nevertheless, preserving these differences 
to enhance the validity of the simulated sampling distributions seems 
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desirable. Given the practical limitations of IRT parameter 
estimation, particularly in relatively small samples, retaining these 
ability distribution differences may be important, since they may 
affect the IRT item bias statistics. When randomly equivalent samples 
of the real data are used to establish cutoff values for the bias 
statistics, this consideration is not taken into account. Hence 
simulating the ability differences under conditions of no bias allowed 
us to set more realistic cutoff values for the bias statistics. 
Certainly, nothing was lost with the procedure and there may be 
circumstances in practice where there is considerable merit to the use 
of simulated distributions of item bias statistics. 

In the present study, taking ability distribution differences, 
though slight, into account, produced higher cutoff values with 
IRT-based methods than were obtained using random samples of the real 
data, resulting in the flagging of fewer items as biased. Given that 
our groups were Males and Females, and no substantial bias was 
expected, the direction of the observed differences supports the use of 
simulated data to establish cut-off points for the IRT item bias 
statistics. 

The lack of agreement observed between the two replications of 
the bias analysis in the real data (see Table 4) highlights the problem 
of using IRT methods in small samples. This leads us to caution 
against using any firm cut-off score for the bias statistics. We 
recommend that the simulated data baseline be used more to give a sense 
of what is extreme in the values of the bias statistics than to label 
an item as potentially biased or not. Smoothing distributions 
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definitely reduced the problem of unstable cut-off points; however, we 
would still recoftmend that precise cut-off scores not be used. 

The results for the Mantel -Haenszel statistic suggest that while 
data can be generated which will return IRT parameter estimates similar 
to those obtained from the real data, it is more difficult to generate 
response patterns which closely resemble the real data. Hence the 
method proposed here of simulating data to obtain baseline values may 
not be useful for bias statistics which are not derived from IRT 
models. 

In summary, application of the IRT computer simulation metho^ for 
generating baseline distributions of item bias statistics is as 
follows: 

1. Choose an IRT model and estimate item and ability parameters 
for the total group of examinees. Assess model-data fit. 
Continue with the method if the model-data fit is acceptable. 
Choose a new, better fitting model, otherwise, and repeat this 
step. Items which are suspected of being biased can be 
removed from the analysis at this step. Removal of items does 
not seem necessary unless the number of items suspected of 
being biased is a significant portion of the total number of 
items in the test (e.g., 10% or more). 

2. Treat the item and ability parameter estimates as "true" 
values and generate a new set of examinee item responses using 
the logistic model of choice in step 1 (see, for example, 
Hambleton * inelli, 1973). 
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3. Split the simulated examinee item responses into the majority 
and minority groups of interest and re-estimate the item 
parameters, treating ability scores obtained at step 1 as 
fixed. (Fixing the ability scores serves two purposes: item 
parameter estimation time is reduced substantially, and 
scaling problems with the data are considerably reduced.) 

4. Choose the IRT item bias statistic (or statistics) of interest 
and carry out the necessary calculations on the ICCs and 
ability estimates for the simulated majority and minority test 
data. 

5. Produce the sampling distribution of the item bias statistics 
obtained from the simulated data, and smooth the distribution 
of resulting item bias statistics to remove some of the 
instability in determining cut-off scores. Determine the 
cut-off value corresponding to the 95th percentile (and/or 
other cut-off values of interest). 

6. Repeat steps 3 and 4 with the real test data. 

7. Interpret the item bias statistics obtained with the, real test 
data using the cut-off values obtained from the simulated test 
data. 

Test developers who carry out the seven steps above should be in a 
position to interpret their item bias statistics in a meaningful way. 
Our view remains, however, that while simulated sampling distributions 
can be very useful when interpretinoi actual item bias statistics, 
because of the instability of determining cut-off scores as well as the 
arbitrariness of the choice of cut-off scores, judgmen^t must still be 
used in making sensible use of item bias statistics. 

ERLC 
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Table 1 



Otstribution of the Item Area Statistics Under the 
Hypothesis of No Bias 



Interval 

All W\« 1 % W 1 
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1 i 1 i X 1 W 1 1 1 W I 








.015 


1.0 


0.3 
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1 (^ A 
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0.0 
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34.4 
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.465 
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3.1 


.495 
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94 9 
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97.6 


93 4 
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.615 
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3.8 
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98.1 
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98.3 


97.4 
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0.7 
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Table 2 

Distribution of the Item Root Mean Squared Difference 
Statistics Under the Hypothesis of No Bias 



Interval Real Data Simulated Data Cum% 

(Mid-Point) CumX Cum« Difference 



.0025 


0.6 


0.6 


0.0 


.0075 


2.2 


0.8 


1.4 


.0125 


4.6 


3.6 


1.0 


m 7R 

• Ui fO 


7 Q 
/ .0 


O A 

0.4 


0.6 


.0225 


11.3 


14.4 


3.1 


.0275 


15.3 


20.5 


5.2 


.0325 


20.6 


27.5 


6.9 


.0375 


27.4 


35.2 


7.8 




03 • 0 


A'i 1 

4 J. 1 


7.5 


.0475 


44.1 


51.1 


7.0 


.0525 


51.4 


57.5 


6.1 


.0575 


57.8 


62.5 


4.7 


.0625 


63.4 


67.0 


3.6 


.UO/ U 


fifl 7 

DO . / 




0 1 

Z. 1 


.0725 


73.7 


73.8 


0.1 


.0775 


78.4 


76.8 


1.6 


.0825 


82.6 


79.8 


2.8 


.0875 


85.6 


82.4 


3.2 




QQ 1 
OO* 1 


OA 7 

o4.7 


^ Ik 

3.4 


.0975 


90.7 


87.2 


3.5 


.1025 


92.8 


89.3 


3.5 


.1075 


94.7 


90,7 


4.0 


.1125 


96.5 


91.8. 


4.7 


11 TK 
• 11/3 


Q7 C 


QO A 

9Z.4 


5.1 


.1225 


97.9 


93.0 


4.9 


.1275 


98.1 


94.2 


3.9 


.1325 


98.2 


95.6 


2.6 


.1375 


98.3 


96.9 


1.4 


.1425 


98.5 


97.7 


0.8 


.1475 


98.8 


98.1 


0.7 


.1525 


99.0 


98.1 


0.9 


.1575 


99.2 


98.1 


1.1 


.1625 


99.3 


98.3 


1.0 


.1675 


99.3 


98.5 


0.8 


.1725 


99.4 


98.8 


0.6 


.1775 


99.6 


99.0 


0.6 


.1825 


100.0 


99.2 


0.8 
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Table 3 

Distribution of the Item Mantel-Haenszel Statistics 
Under the Hypothesis of No Bias 



Interval 


Real Data 


Simulated Data 


Cum% 


(Mid-Point) 


Cumf 


CumX 


Difference 


.1 


20.7 


25.2 


4.5 


.3 


42.6 


49.1 


6.5 


.5 


58.1 


65.3 


7.2 


.7 


65.9 


72.6 


6.7 


.9 


69.8 


74.7 


4.9 


1.1 


74.6 


77.6 


3.0 


1.3 


79.3 


80.1 


0.8 


1.5 


83.3 


82.7 


0.6 


1.7 


86.4 


85.5 


0.9 


1.9 


88.6 


37.9 


0.7 


2.1 


90.1 


89.2 


0.9 


2.3 


91.1 


90.0 


1.1 


2.5 


92.1 


91.4 


0.7 


?..7 


93.2 


93.2 


0.0 


c\9 


93.7 


94.9 


1.2 


3.1 


94.2 


96.1 


1.9 


3.3 


94.9 


96.9 


2.0 


3.5 


95.7 


97.1 


1.4 


3.7 


96,5 


97.2 


0.7 


3.9 


97.4 


97.3 


0.1 


4.1 


97.9 


97.3 


0.6 


4.3 


98.0 


97.4 


0.6 


4.5 


98.0 


97.5 


0.5 


4.7 


98.1 


97.6 


0.5 


4.9 


98.1 


97.9 


0.2 


•5.1 


98.1 


98.1 


0.0 


5.3 


98.2 


98.4 


0.2 


5.5 


98.3 


98.5 


0.2 


5.7 


98.5 


98.7 


0.2 


5.9 


98.7 


98.7 


0.0 


6.1 


98.7 


98.7 


0.0 


6.3 


98.8 


98.7 


0.1 


6.5 


99.1 


98.8 


0.3 


6.7 


99.5 


99.1 


0.4 


6.9 


100.0 


99.5 


0.5 
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Table 4 



Choice of Distribution (Real or Simulated) on the 
Determination of Cut-off Scores and Identification 
of Potentially Biased Test Items 





Real Null Distribution 


Simulated Null 


Distribution 


Difference 


Bias Statistic 


Critical Value^ Biased Items^ 


Critical Value 


Biased Items 




Area 


.544 4 

(11) 




(6) 


3 

(5) 


Root Mean Squared 
Difference 


.113 4 
(10) 


.134 


3 

(3) 


1 

(7) 


Mantel -Haenszel 


3.42 6 
(19) 


3.03 


6 

(21) 


O CSJ 



1 At the .05 level . 

2 The numbers in brackets correspond to the numbers of test items identified as potentially biased 
in a replication of the study with second male and female samples. 



Figure 1. A comparison of the simulated and real sampling distributions of the Item 
Area Statistics under the hypothesis of no bias. 
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Figure 2. A comparison of the simulated and real sampling distribution of the Item Root 
Mean Squared Difference Statistics under the hypothesis of no bias. 
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Figure 3. A comparison of the simulated and real sampling distributions of the Item 
Mantel -Haenszel Statistics under the hypothesis of no bias. 




Figure 4. A comparison of the distribution of Item Area Statistics for the male and female 
groups with the smoothed distribution of the same statistic for the simulated 
male and female groups under the hypothesis of no bias. 




Figure 5. A comparison of the distribution of Item Root Mean Squared Difference Statistics 
for the male and female groups with the smoothed distribution of the same 
statistic for the simulated male and female groups under the hypothesis of no bias. 
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Figure 6. A comparison of the distribution of Mantel -Haenszel Statistics for the male and 

female groups with the smoothed distribution of the same statistic for the simulated 
male and female groups under the hypothesis of no bias. 
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Figure 7. A comparison of the distribution of Item Area Statistics for the total sample 
male and female groups with the smoothed distribution of the same statistic for 
the total sample simulated male and female groups under the hypothesis of no bias. 
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Figure 8. A comparison of the distribution of Item Root Mean Squared Differences for the 
total sample male and female groups with the smoothed distribution of the same 
statistic for the total Sample simulated male and female groups under the 
hypothesis of no bias. 
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Figure 9. A comparison of the distribution of Mantel -Haenszel Statistics for the total 

sample male and female groups with the smoothed distribution of the same statistic 
for the total sample simulated male and female groups under the hypothesis of 
no bias. 




